
Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style so it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include a live version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in Python, get cozy with the Jupyter Notebook environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of Python, how to tidy our data (data wrangling), subset and merge data. We'll take a break from data wrangling to spend our fourth lecture learning how to generate exploratory data plots. Then we'll slide into using the power of python and programming to use flow control before visiting text manipulation techniques in lectures 5 and 6. Data cleaning and string manipulation is really the battleground of coding - getting your data into the format where you can analyse it. Lastly, we will learn to write customized functions to help scale up your analyses.

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
There is no single path correct from A to B - although some paths may be more elegant, or more efficient than others. With that in mind, the emphasis in this lecture series will be on:
pandas series of packages for working with tabular data. This resource is well-maintained by a large community of developers. While not always the "fastest" approach this additional layer can help ensure your code still runs (somewhat) smoothly later down the road.Welcome to this fourth lecture in a series of seven. Today we will pick up where we left off last week with our merged data. We'll learn how to explore the data, summarize it, and plot it!
At the end of this lecture we will aim to have covered the following topics:
matplotlib.pyplot package.seaborn package.grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2023-01-IntroPython/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
The following datasets used in this week's class come from a preprint manuscript on bioRxiv entitled "High-throughput phenotyping of C. elegans wild isolates reveals specific resistance and susceptibility traits to infection by distinct microsporidia species" by Mok et al., 2022. These datasets focus on the an analysis of infection in wild isolate strains of the nematode C. elegans by environmental pathogens known as microsporidia. The authors collected embryo counts from individual animal in the population after population-wide infection by microsporidia and we'll spend our next few classes working with the dataset to learn how to format and manipulate it.
This is a result of our efforts (mostly) from last lecture. After transforming a wide-format version of our measurement data, we merged it with some metadata regarding our experiments and now it is ready to be visualized!
This is an imaging analysis of infected C. elegans strains N2 and JU1400 measuring the overall number of pixels for each animals and the number of fluorescent (infected) pixels within the same area.
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
random is a package with methods to add pseudorandomness to programs
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
os
pandas
matplolib
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
# !pip install seaborn
Recall from last week, we spent our lecture converting data from the a series of worm/pathogen interactions which contained observations about the formation of spores, meronts, and embryos within individual animals in an infected population. The process involved conversion from a wide-format to long-format along with the merging of this data to a set of metadata.
Now that the wrangling is completed, we can perform some exploratory data analysis (EDA). The process of EDA investigates your data to identify abnormalities, summarize its main characteristics and identify potential patterns or trends for further validation. We did some initial statistical summarization on the numerical and non-numerical data but today we'll dig deeper using some additional tools in our Python pockets.
With our EDA today we will try to answer questions like:
Let's open a version of our dataset from last week. We'll find it in the file data/embryo_long_merged.csv. Recall that it is always good practice to explore your data to find out more about aspects such as probability distributions, outliers, and central tendency/dispersion measures.
Now we are going to do some exploratory data analysis (EDA) on embryo_long_merged.csv which we made last class.
# Read in embryo_long_merged.csv
embryo_merged = pd.read_csv("data/embryo_long_merged.csv")
# Take a peek at the data
embryo_merged.head()
# How big is this dataset?
embryo_merged.info()
| worm.number | date | wormStrain | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | ... | Plate Size | Spores/cm2 | Temp | infection.type | Staining Date | Stain type | Slide date | Slide number | Slide Box | Imaging Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 10 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 1 | 2 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 9 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 2 | 3 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 16 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 3 | 4 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 13 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 4 | 5 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 8 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
5 rows × 31 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null object 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 19 Infection Round 11149 non-null int64 20 40X OP50 (mL) 11149 non-null float64 21 Plate Size 11149 non-null int64 22 Spores/cm2 11149 non-null float64 23 Temp 11149 non-null int64 24 infection.type 11149 non-null object 25 Staining Date 11149 non-null int64 26 Stain type 11149 non-null object 27 Slide date 11149 non-null int64 28 Slide number 11149 non-null int64 29 Slide Box 11149 non-null int64 30 Imaging Date 11149 non-null int64 dtypes: bool(2), float64(3), int64(15), object(11) memory usage: 2.5+ MB
So as a reminder, we've imported a dataset that is 11,149 rows with 30 columns of data. At this point, the only measurement data exists in 3 columns: merontsPresent, sporesPresent, and numEmbryos.
Before we go any further, however, let's drop some of the extraneous metadata that we won't need for our analyses. Pretty much everything from Infection Round onwards is unnecessary. From our call to .info() we can see that we just need columns up until index 19.
# Subset the data from our dataset
embryo_merged_subset = embryo_merged.iloc[:, :19]
# What does the subset look like?
embryo_merged_subset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null object 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 dtypes: bool(2), float64(1), int64(7), object(9) memory usage: 1.5+ MB
Let's begin with a deceptively simply question about our data. As we'll see, however, it required more than a simple method call to our data in order to discern. Rather we'll walk through the thought process so you can avoid potential pitfalls later in your own analyses.
There is a lot of subgrouped data hidden with our dataset. Experiments are classified by their date, wormStrain, pathogenStrain, pathogenDose, and timepoint. The combination of all five of these can also be found in the experiment column, although that combination may not always be useful to us.
Let's begin with the describe() method to review any numerical data that we can.
# Get a description of the numeric data
embryo_merged_subset.describe()
| worm.number | date | numEmbryos | Infection Date | Plate Number | Total Worms | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|
| count | 11149.000000 | 11149.00000 | 11149.000000 | 11149.000000 | 11149.000000 | 11149.000000 | 11149.00000 | 11149.000000 |
| mean | 27.499148 | 197408.14378 | 9.148444 | 197405.090143 | 18.785721 | 1309.444793 | 273701.61001 | 25.916820 |
| std | 16.460217 | 4864.97377 | 7.509851 | 4864.935158 | 15.939817 | 1678.485600 | 124153.58178 | 33.928541 |
| min | 1.000000 | 190426.00000 | 0.000000 | 190423.000000 | 1.000000 | 500.000000 | 63625.00000 | 0.000000 |
| 25% | 14.000000 | 190426.00000 | 2.000000 | 190423.000000 | 7.000000 | 1000.000000 | 176000.00000 | 0.000000 |
| 50% | 27.000000 | 200714.00000 | 9.000000 | 200711.000000 | 14.000000 | 1000.000000 | 176000.00000 | 8.196721 |
| 75% | 40.000000 | 200825.00000 | 15.000000 | 200822.000000 | 25.000000 | 1000.000000 | 427000.00000 | 56.818182 |
| max | 115.000000 | 200918.00000 | 48.000000 | 200915.000000 | 63.000000 | 10000.000000 | 427000.00000 | 113.636364 |
describe() method to summarize non-numeric columns¶Recall that we can also summarize our non-numeric data, to a certain extent, as long as we provide it properly to the describe() method. This method can identify the "top" occuring entry in a column as well as it's frequency. We already know that the wormStrain column contains the strain information on each individual worm measured in our data. The same goes for pathogenStrain which will contain some similar information about our pathogens used. It should be simple enough to just create a summary of that information. Let's give it a try.
# Use the describe method on the OTU column from our merged_subset
embryo_merged_subset.loc[:,['wormStrain', 'pathogenStrain']].describe()
| wormStrain | pathogenStrain | |
|---|---|---|
| count | 11149 | 11149 |
| unique | 18 | 10 |
| top | N2 | LUAm1 |
| freq | 2941 | 5030 |
At a quick glance, we can answer our first question - that the N2 animal is the most measured one in our infection studies while the pathogen LUAm1 is the most measured pathogen in our study. There are a few catches to these results, however:
Although N2 is the most-often measure individual animal, this is biased by the fact that it acts as a control strain in many experiments. If we were to look at specific groups of experiments and how often N2 was included, would it still be the most prevalent?
Our measurements might be comprised of entries where LUAm1 is a pathogenStrain, but it's pathogenDose is 0! So it's not really being used to infect at all.
Let's adress the 2nd question and circle back around to the first in a little bit as it is slightly more complex.
Last lecture we saw a few examples where we subset our data using a simple conditional statement like isna(). While we referred to this as "slicing" our data, you can also think of it as a way to filter data. We can of course, filter our data using other conditional statements and before summarizing your data, you should consider the nature of your values! In our dataset it appears that the pathogenDose can denote whether or not an animal was exposed to a pathogen or was "mock-infected". A mock-infected sample would have a pathogenDose value of 0.
To solve our issues with summarizing the pathogen usage conundrum we turn to filtering our data by the pathogenDose column. We'll do this in two steps:
# Recall that we can broadcast a conditional query to multiple values in a DataFrame
embryo_merged_subset.loc[:, 'pathogenDose'] > 0
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [6], in <cell line: 2>() 1 # Recall that we can broadcast a conditional query to multiple values in a DataFrame ----> 2 embryo_merged_subset.loc[:, 'pathogenDose'] > 0 File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\core\ops\common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 68 return NotImplemented 70 other = item_from_zerodim(other) ---> 72 return method(self, other) File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\core\arraylike.py:58, in OpsMixin.__gt__(self, other) 56 @unpack_zerodim_and_defer("__gt__") 57 def __gt__(self, other): ---> 58 return self._cmp_method(other, operator.gt) File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\core\series.py:6243, in Series._cmp_method(self, other, op) 6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True) 6242 with np.errstate(all="ignore"): -> 6243 res_values = ops.comparison_op(lvalues, rvalues, op) 6245 return self._construct_result(res_values, name=res_name) File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\core\ops\array_ops.py:287, in comparison_op(left, right, op) 284 return invalid_comparison(lvalues, rvalues, op) 286 elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str): --> 287 res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) 289 else: 290 res_values = _na_arithmetic_op(lvalues, rvalues, op, is_cmp=True) File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\core\ops\array_ops.py:75, in comp_method_OBJECT_ARRAY(op, x, y) 73 result = libops.vec_compare(x.ravel(), y.ravel(), op) 74 else: ---> 75 result = libops.scalar_compare(x.ravel(), y, op) 76 return result.reshape(x.shape) File ~\anaconda3\envs\CSBJupyter\lib\site-packages\pandas\_libs\ops.pyx:107, in pandas._libs.ops.scalar_compare() TypeError: '>' not supported between instances of 'str' and 'int'
Oops! The pathogenDose column is a string! We need to fix that up first an convert it from values like "0M" to a float value like "0.0".
Recall we have at our disposal the pop(), str.split(), astype(), and insert() methods. We'll use those now to fix the pathogenDose column and replace it in our embryo_merged_subset.
# Use the insert method
embryo_merged_subset.insert(loc = 4, # Location to insert at
column = 'pathogenDose', # The column name after insertion
# To calculate the value, we'll pop from the subset
value = (embryo_merged_subset.pop('pathogenDose')
# Break up the float from the "M"
.str.split(pat = "M", expand = True)
# Convert the first column to a float, and then provide only that to insert
.astype({0:'float64'})[0])
)
# Check on the updated subset data
embryo_merged_subset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null float64 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 dtypes: bool(2), float64(2), int64(7), object(8) memory usage: 1.5+ MB
Now check on that conditional filtering again!
# Now we can use a conditional comparison on the pathogenDose column
embryo_merged_subset.loc[:, 'pathogenDose'] > 0
0 False
1 False
2 False
3 False
4 False
...
11144 True
11145 True
11146 True
11147 True
11148 True
Name: pathogenDose, Length: 11149, dtype: bool
# Supply the result of the conditional query as a filtering criteria
print("Filtered pathogenStrain summary")
embryo_merged_subset.loc[embryo_merged_subset.loc[:, 'pathogenDose'] > 0, # remember we're filtering by rows!
# We only need the pathogenStrain column to summarize
'pathogenStrain'].describe()
print("Unfiltered pathogenStrain summary")
# Compare that to the unfiltered subset
embryo_merged_subset.loc[:, 'pathogenStrain'].describe()
Filtered pathogenStrain summary
count 7730 unique 10 top LUAm1 freq 2838 Name: pathogenStrain, dtype: object
Unfiltered pathogenStrain summary
count 11149 unique 10 top LUAm1 freq 5030 Name: pathogenStrain, dtype: object
.describe() method returns a Series object¶A useful aspect of the .describe() method is that it also returns a Series object, which means its information can be retrieved or saved for further use! Recall we can choose from the possible index names. In the case of non-numeric summaries, we can use count, unique, top, or freq. These are also 0-indexed!
pathogen_Summary = embryo_merged_subset.loc[embryo_merged_subset.loc[:, 'pathogenDose'] > 0,
# We only need the pathogenStrain column to summarize
'pathogenStrain'].describe()
# Of the LUAm1 infections, which worm strain reigns supreme?
embryo_merged_subset.loc[(embryo_merged_subset["pathogenStrain"] == pathogen_Summary.top), 'wormStrain'].describe()
count 5030 unique 15 top N2 freq 864 Name: wormStrain, dtype: object
filter() method¶You may be wondering to yourself, surely there must be a filter() method implemented with the DataFrame. It seems like such an essential part of working with DataFrames. You're right that such a method exists but you would be wrong to think that it is used for conditional filtering.
The filter() method is used merely for subsetting your DataFrame by column name or by regular expression pattern (See Lecture 06). It does not use conditional logic to subset rows from your data. While this can be helpful in certain contexts, it does NOT implement the idea of filtering rows of our data based on conditional criteria.
Here's a quick example of how to use it.
# Select the wormStrain, pathogenStrain, and numEmbryos columns
embryo_merged_subset.filter(items = ['wormStrain', 'pathogenStrain', 'numEmbryos'])
| wormStrain | pathogenStrain | numEmbryos | |
|---|---|---|---|
| 0 | AB1 | LUAm1 | 10 |
| 1 | AB1 | LUAm1 | 9 |
| 2 | AB1 | LUAm1 | 16 |
| 3 | AB1 | LUAm1 | 13 |
| 4 | AB1 | LUAm1 | 8 |
| ... | ... | ... | ... |
| 11144 | N2 | ERTm5-96H | 1 |
| 11145 | N2 | ERTm5-96H | 0 |
| 11146 | N2 | ERTm5-96H | 0 |
| 11147 | N2 | ERTm5-96H | 3 |
| 11148 | N2 | ERTm5-96H | 2 |
11149 rows × 3 columns
Circling back to our first question, we discovered that N2 is the worm strain that is most often measured across all of our datasets, but the dataset contains groups of different infection experiments. Is N2 the most prevalent strain when we look at its inclusion within individual experiments? How do we explore this?
groupby() method¶Thinking about our problem, we already have an identifier that breaks down each experimental grouping - date. Each different date essentially sets up a different experimental replicate. We could use our newly-taught filtering techniques, but we would have to cycle through each potential value and summarize the subset. (Honestly this would have been my approach not so long ago!).
Luckily for you, the groupby() method can sort all of that data for you based on the criteria you provide. The important parameters to us today are:
by: A function, label, or list of labels that you want to use to determine grouping criteria. You can even provide a dictionary object where specific key:values pairs can be used to determine groupings.axis: How to split along the rows (0, default) or columns (1)as_index: A boolean to determine if the index labels should be based on group labels (True by default)# How many individual dates/replicates are there?
len(embryo_merged_subset.loc[:,'date'].unique())
11
# Group our subset data by the 'date' column
embryo_merged_subset.groupby(by = ['date'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E3805FA380>
head() method to view rows from each group¶As you can see above, we create a DataFrameGroupBy object but if we attempted to look at it, it would look pretty much like the original merged_subset. The major difference is that the data has now been essentially sorted by the date column. In order to view part of it, we can use the head(n) method which will return n rows from each group.
# Group our subset data by the 'date' column and view 1 row from each group
embryo_merged_subset.groupby(by = ['date']).head(1)
| worm.number | date | wormStrain | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | experiment | experimenter | description | Infection Date | Plate Number | Total Worms | Spore Lot | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 190426 | AB1 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 10 | 190426_AB1_LUAm1_0M_72hpi | CM | Wild isolate phenoMIP retest | 190423 | 7 | 1000 | 2A | 176000 | 0.0 |
| 594 | 1 | 200707 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 12 | 200707_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED30... | 200704 | 5 | 1000 | 2A | 176000 | 0.0 |
| 842 | 1 | 200714 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 8 | 200714_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED3052 | 200711 | 5 | 1000 | 2A | 176000 | 0.0 |
| 1092 | 1 | 200721 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 12 | 200721_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED3052 | 200718 | 5 | 1000 | 2A | 176000 | 0.0 |
| 1342 | 1 | 200821 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 23 | 200821_AWR144_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test 6X NIL, VC40171... | 200818 | 7 | 1000 | 2A | 176000 | 0.0 |
| 1692 | 1 | 200825 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 14 | 200825_AWR144_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test 6X NIL, VC40171... | 200822 | 7 | 1000 | 2A | 176000 | 0.0 |
| 1942 | 1 | 200904 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 16 | 200904_AWR144_LUAm1_0M_72hpi | CM | NIL tests for Lua1 and ERTM5, low dose ERTM5 t... | 200901 | 7 | 1000 | 2A | 176000 | 0.0 |
| 5158 | 1 | 200915 | AWR144 | ERTm5 | 0.0 | Mock | 72hpi | False | False | 19 | 200915_AWR144_ERTm5_0M_72hpi | CM | NIL tests for ERTM5 | 200912 | 5 | 1000 | 2 | 427000 | 0.0 |
| 5358 | 1 | 200918 | AWR144 | ERTm5 | 0.0 | Mock | 72hpi | False | False | 26 | 200918_AWR144_ERTm5_0M_72hpi | CM | NIL tests for ERTM5 | 200915 | 5 | 1000 | 2 | 427000 | 0.0 |
| 10551 | 1 | 200905 | JU1400 | ERTm5-96H | 0.0 | Mock | 96hpi | False | False | 9 | 200905_JU1400_ERTm5-96H_0M_96hpi | CM | NIL tests for Lua1 and ERTM5, low dose ERTM5 t... | 200901 | 26 | 1000 | 2 | 427000 | 0.0 |
| 10649 | 1 | 200916 | JU1400 | ERTm5-96H | 0.0 | Mock | 96hpi | False | False | 15 | 200916_JU1400_ERTm5-96H_0M_96hpi | CM | NIL tests for ERTM5 | 200912 | 12 | 500 | 2 | 427000 | 0.0 |
Using the .groupby() method and head(), we can view a representative observation from each group. To answer our above question, however, we need more information. If we repeat our above code but also include wormStrain and pathogenStrain columns in our grouping, what will that produce?
# Group our subset data by the 'date' and 'wormStrain' columns and grab the first row from each
# How big is the result?
embryo_merged_subset.groupby(by = ['date', 'wormStrain', 'pathogenStrain']).head(1).shape
(99, 19)
From our results, we see there are 99 separate combinations of date:wormStrain:pathogenStrain in our grouped DataFrame! Now that we have our groups we can begin to apply functions to summarize data from each group. Back to our question, we can now ask, which wormStrain produces the largest number of date:wormStrain:pathogenStrain combinations.
We've already used some functions like unique() but other helpful functions to apply include sum(), max(), min() and median(). Functions like idxmin() and idxmax() will return the index of the min and max values respectively but only the first occurence of the value being sought.
In this case what we really need is something that can help produce a frequency table. The .value_counts() can be applied to Series to produce such a table. We just need to provide it with the column(s) we'd like to use with the subset parameter.
In this case, we are interested in wormStrain so let's see what that looks like.
# Convert the "wormStrain" column from our grouped data to a frequency table
(embryo_merged_subset.groupby(by = ['date', 'wormStrain', 'pathogenStrain']).head(1)
# Create the frequency table
.value_counts(subset = ['wormStrain'])
)
wormStrain N2 29 JU1400 28 MY1 7 AWR145 6 AWR144 6 VC40171 3 VC20019 3 ED3052A 3 ED3052B 3 MY2 2 JU360 2 JU642 1 JU397 1 JU300 1 MY6 1 ED3042 1 CB4856 1 AB1 1 dtype: int64
So, from our groupings it looks like N2 edges out JU1400 by just a single group. That's pretty close but now we've answered our question!
Let's approach this question by identifying our criteria:
# Identify the mean embryo value of uninfected strains
# Subset by uninfected animals
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] == 0, :]
# Group by wormStrain and take the numEmbryos colums
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
)
wormStrain AB1 10.903226 AWR144 19.228000 AWR145 20.656000 CB4856 20.750000 ED3042 13.147541 ED3052A 10.526667 ED3052B 12.047297 JU1400 11.139501 JU300 15.622951 JU360 16.788618 JU397 11.457143 JU642 14.733333 MY1 11.725581 MY2 21.886179 MY6 18.360656 N2 18.912500 VC20019 15.143678 VC40171 4.586667 Name: numEmbryos, dtype: float64
sort_values() to organize your data¶It looks like we get the results we're looking for but the data is sorted by alphabetical index order. What if we're interested, however, in finding the highest and lowest values in our dataset? Since it's small, a quick and easy way to ascertain this is with the sort_values() function which will sort by ascending order.
If we had a multi-column DataFrame, we could use the by parameter to select multiple columns to sort with.
To sort in descending or reverse order, we can set the ascending = False parameter.
# Identify the mean embryo value of uninfected strains AND sort them!
# Subset by uninfected animals
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] == 0, :]
# Group by wormStrain and take the numEmbryos colums
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
# Sort the data by descending order
.sort_values(ascending = False)
)
wormStrain MY2 21.886179 CB4856 20.750000 AWR145 20.656000 AWR144 19.228000 N2 18.912500 MY6 18.360656 JU360 16.788618 JU300 15.622951 VC20019 15.143678 JU642 14.733333 ED3042 13.147541 ED3052B 12.047297 MY1 11.725581 JU397 11.457143 JU1400 11.139501 AB1 10.903226 ED3052A 10.526667 VC40171 4.586667 Name: numEmbryos, dtype: float64
We now have an answer to our question and have identified the mean number of embryos per uninfected animal in each strain. This is gets us a baseline value for each strain that can be used in later comparisons!
# Identify the mean embryo value of uninfected strains AND sort them!
# Subset by uninfected animals
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] == 0, :]
# Group by wormStrain and take the numEmbryos colums
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
# How do we calculate the median?
...
)
Now that we've got a few very helpful tools under our belt, we can take our query to the next level and ask what the mean and standard deviation of any individual worm strain is across multiple replicates.
Time to think about your dataset in relationship to your question. We know already that each worm strain may appear within any infection experiment but a dose of 0 represent uninfected animals. Again, it will be important to filter our data before summarizing it. Then you must understand how to what groupings you are looking for and what measurement you'd like to summarize. Let's break down the problem:
date, wormStrain, and pathogenStrain# Determing the standard deviation of the mean embryo counts across each strain
# IN the uninfected state - ie a baseline embryo count.
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] == 0, :]
# Group by infection experiment
.groupby(by = ['date', 'wormStrain', 'pathogenStrain'])
# Create the frequency table on numEmbryos
['numEmbryos']
.mean()
# Group the series of means again by wormStrain
.groupby(by=['wormStrain'])
# Summarize each group of numEmbryo means
.describe()
# Sort the data by descending order
.sort_values(by = 'mean', ascending = False)
)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| wormStrain | ||||||||
| MY2 | 2.0 | 21.892460 | 0.364216 | 21.634921 | 21.763690 | 21.892460 | 22.021230 | 22.150000 |
| CB4856 | 1.0 | 20.750000 | NaN | 20.750000 | 20.750000 | 20.750000 | 20.750000 | 20.750000 |
| AWR145 | 5.0 | 20.656000 | 2.160435 | 17.180000 | 20.520000 | 20.860000 | 21.760000 | 22.960000 |
| AWR144 | 5.0 | 19.228000 | 1.990407 | 16.940000 | 17.400000 | 20.060000 | 20.100000 | 21.640000 |
| N2 | 13.0 | 18.795973 | 2.816125 | 15.240000 | 16.300000 | 19.358209 | 20.880000 | 24.541667 |
| MY6 | 1.0 | 18.360656 | NaN | 18.360656 | 18.360656 | 18.360656 | 18.360656 | 18.360656 |
| JU360 | 2.0 | 16.799841 | 1.952303 | 15.419355 | 16.109598 | 16.799841 | 17.490085 | 18.180328 |
| JU300 | 1.0 | 15.622951 | NaN | 15.622951 | 15.622951 | 15.622951 | 15.622951 | 15.622951 |
| VC20019 | 3.0 | 15.181159 | 0.616995 | 14.468750 | 15.000000 | 15.531250 | 15.537364 | 15.543478 |
| JU642 | 1.0 | 14.733333 | NaN | 14.733333 | 14.733333 | 14.733333 | 14.733333 | 14.733333 |
| ED3042 | 1.0 | 13.147541 | NaN | 13.147541 | 13.147541 | 13.147541 | 13.147541 | 13.147541 |
| ED3052B | 3.0 | 12.020833 | 2.987960 | 10.062500 | 10.301250 | 10.540000 | 13.000000 | 15.460000 |
| MY1 | 4.0 | 11.611538 | 1.578067 | 9.520000 | 10.960000 | 11.840000 | 12.491538 | 13.246154 |
| JU397 | 1.0 | 11.457143 | NaN | 11.457143 | 11.457143 | 11.457143 | 11.457143 | 11.457143 |
| JU1400 | 12.0 | 11.217214 | 3.025072 | 5.724638 | 9.470000 | 11.720000 | 13.585000 | 14.660000 |
| AB1 | 1.0 | 10.903226 | NaN | 10.903226 | 10.903226 | 10.903226 | 10.903226 | 10.903226 |
| ED3052A | 3.0 | 10.526667 | 0.761665 | 9.700000 | 10.190000 | 10.680000 | 10.940000 | 11.200000 |
| VC40171 | 3.0 | 4.586667 | 1.571284 | 2.980000 | 3.820000 | 4.660000 | 5.390000 | 6.120000 |
Notice the presence of NaN values in our standard deviation columns? Can you tell why this is the case? What is the relationship between all of the strains with such a value?
Circling back towards our initial questions, one in a similar vein would be to see how many pathogens have been tested on each strain. Let's remember to plan our analysis:
pathogenDose is >0Let's start with the first 3 criteria:
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
)
wormStrain AB1 [LUAm1] AWR144 [LUAm1, ERTm5] AWR145 [LUAm1, ERTm5] CB4856 [ERTm5] ED3042 [LUAm1] ED3052A [LUAm1] ED3052B [LUAm1] JU1400 [LUAm1, ERTm5, AWRm78, LUAm3, MAM1, LUAm1-HK, ... JU300 [ERTm5] JU360 [LUAm1, ERTm2] JU397 [LUAm1] JU642 [LUAm1] MY1 [LUAm1, ERTm5] MY2 [ERTm5, ERTm2] MY6 [LUAm1] N2 [LUAm1, ERTm5, ERTm2, AWRm78, LUAm3, MAM1, LUA... VC20019 [LUAm1, ERTm5, ERTm2] VC40171 [LUAm1] Name: pathogenStrain, dtype: object
You can see from our above results that we have generated a Series object, where the pathogen strains associated with each worm strain are stored as an np.array object. We know that we can extract the .size property from those objects so that should get us our answer!
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
# Extract the size of each array
.size
)
18
Uh oh, just a single number - 18. That's actually how many worm strains we had in the Series object generated by the call to .unique(). What we wanted, instead, was the size of each element in the Series. How do we obtain that from what we have so far?
apply() method to broadcast a function to individual elements¶We haven't spent much time discussing this about DataFrames, but you may recall that np.array objects can perform element-wise arithmetic. The general term for this ability is called broadcasting. In the case of DataFrames, we can perform similar functions.
We can:
apply() our own custom funcitons to elementsThe apply() method takes the form of apply(func, args=(), **kwargs). We'll talk a little more about some of these parameters in later lectures but for now we have:
func: the name of the function you want to useargs: a tuple of any additional arguments that are needed for func to work.In our case, we want to apply the .size attribute to each array in our Series. You'll see one more unfamiliar piece of code: lambda. This is how we can tell Python we want to make a quick function on the spot. We'll cover this more in a later lecture as well so for now we'll just roll with it.
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
# Extract the size of each array
.apply(lambda x: x.size)
)
wormStrain AB1 1 AWR144 2 AWR145 2 CB4856 1 ED3042 1 ED3052A 1 ED3052B 1 JU1400 9 JU300 1 JU360 2 JU397 1 JU642 1 MY1 2 MY2 2 MY6 1 N2 10 VC20019 3 VC40171 1 Name: pathogenStrain, dtype: int64
nunique() on a grouped dataframe to return the number of unique elements¶Looking at our code above, we went through 5 steps to get a final answer:
What if instead, we used a helpful method - nunique() to count the number of unique elements in our grouped dataframe. This method simplifies the process by combining the unique() and len() methods together (which we have used previously to achieve the same goal). Furthermore you can ignore the NA values by default. Using this method simplifies our process slightly into just 3 steps of code as we'll see.
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby('wormStrain')
# Determine the length of unique values in each
.nunique()
)
| worm.number | date | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | experiment | experimenter | description | Infection Date | Plate Number | Total Worms | Spore Lot | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wormStrain | ||||||||||||||||||
| AB1 | 60 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 16 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| AWR144 | 50 | 5 | 2 | 2 | 2 | 1 | 2 | 2 | 22 | 6 | 1 | 3 | 5 | 3 | 1 | 2 | 2 | 2 |
| AWR145 | 50 | 5 | 2 | 2 | 2 | 1 | 2 | 2 | 14 | 6 | 1 | 3 | 5 | 3 | 1 | 2 | 2 | 2 |
| CB4856 | 64 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 27 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| ED3042 | 56 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 10 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| ED3052A | 50 | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 14 | 4 | 1 | 2 | 3 | 2 | 1 | 1 | 1 | 2 |
| ED3052B | 50 | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 13 | 4 | 1 | 2 | 3 | 2 | 1 | 1 | 1 | 2 |
| JU1400 | 115 | 11 | 9 | 10 | 6 | 2 | 2 | 2 | 25 | 41 | 1 | 7 | 9 | 18 | 3 | 3 | 5 | 12 |
| JU300 | 69 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 17 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| JU360 | 65 | 1 | 2 | 4 | 2 | 1 | 1 | 2 | 23 | 4 | 1 | 1 | 1 | 4 | 1 | 2 | 2 | 4 |
| JU397 | 60 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 17 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| JU642 | 61 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 19 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| MY1 | 62 | 4 | 2 | 5 | 3 | 1 | 2 | 2 | 24 | 12 | 1 | 3 | 4 | 6 | 1 | 2 | 2 | 5 |
| MY2 | 71 | 1 | 2 | 4 | 2 | 1 | 1 | 2 | 34 | 4 | 1 | 1 | 1 | 4 | 1 | 2 | 2 | 4 |
| MY6 | 61 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 22 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| N2 | 85 | 11 | 10 | 12 | 6 | 2 | 2 | 2 | 37 | 43 | 1 | 7 | 9 | 21 | 3 | 3 | 6 | 15 |
| VC20019 | 60 | 1 | 3 | 6 | 2 | 1 | 2 | 2 | 20 | 6 | 1 | 1 | 1 | 6 | 1 | 3 | 3 | 6 |
| VC40171 | 50 | 3 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 1 |
In 3 commands, we were able to determine not only the number of unique values in by pathogen strain but also across all of the data columns! What happens if we want to do more than just calculate a single value?
agg() function to generate a summary from multiple functions¶Rather than using just a single function like nunique() you can actually apply multiple functions to a grouped dataframe to generate a summary of the information. In this case we'll return to looking at the pathogenStrain column to simplify our output. We are also a little limited by the kind of summary we can achieve from string data. We can, however, still count() the number of entries in each group.
To combine both of these methods to produce a summary, we'll use the agg() method, which will accept a list of function names like "count" and "nunique" but also "sum", "min", "max" and other methods found in the GroupBy object. You can also use your own custom function just like the .apply() method.
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the total number of values in each group, the number of unique, and what they are
.agg([lambda x: x.size, "count", "nunique", "unique"])
)
| <lambda_0> | count | nunique | unique | |
|---|---|---|---|---|
| wormStrain | ||||
| AB1 | 120 | 120 | 1 | [LUAm1] |
| AWR144 | 300 | 300 | 2 | [LUAm1, ERTm5] |
| AWR145 | 300 | 300 | 2 | [LUAm1, ERTm5] |
| CB4856 | 125 | 125 | 1 | [ERTm5] |
| ED3042 | 107 | 107 | 1 | [LUAm1] |
| ED3052A | 183 | 183 | 1 | [LUAm1] |
| ED3052B | 200 | 200 | 1 | [LUAm1] |
| JU1400 | 2109 | 2109 | 9 | [LUAm1, ERTm5, AWRm78, LUAm3, MAM1, LUAm1-HK, ... |
| JU300 | 132 | 132 | 1 | [ERTm5] |
| JU360 | 251 | 251 | 2 | [LUAm1, ERTm2] |
| JU397 | 113 | 113 | 1 | [LUAm1] |
| JU642 | 121 | 121 | 1 | [LUAm1] |
| MY1 | 613 | 613 | 2 | [LUAm1, ERTm5] |
| MY2 | 250 | 250 | 2 | [ERTm5, ERTm2] |
| MY6 | 112 | 112 | 1 | [LUAm1] |
| N2 | 2221 | 2221 | 10 | [LUAm1, ERTm5, ERTm2, AWRm78, LUAm3, MAM1, LUA... |
| VC20019 | 323 | 323 | 3 | [LUAm1, ERTm5, ERTm2] |
| VC40171 | 150 | 150 | 1 | [LUAm1] |
Now that we've walked through a few analysis directions and introduced a number of great tools, you'll notice a basic pattern to most analyses:
.groupby()..agg() or with custom functions via apply().After more practice and experience some of these functions will become second nature to your coding choices.
# Filter for only uninfected data
(embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] > 0, :]
# Group our data by worm strain
.groupby(by = ['pathogenStrain', 'pathogenDose'])
# Isolate pathogen strain information in each group
['wormStrain']
# Determine the total number of values in each group, the number of unique, and what they are
.agg(...)
# Sort the values to simplify your search
.sort_values(...)
)
pyplot module¶Now that we've had a chance to look at our data close up, let's talk about how we can use exploratory plots to give us a quick visual assessment of our data. We can use these visualizations to help make decisions about how to further analyse our data. Is there a difference between different groups of data? Does it look like there might be any bias between our datasets? What does the overall distribution of our sampling look like?
Often when trying to convey a message about our data through a visualization, we want to choose the right kind of visualization. These visualizations can also be referred to as figures or plots. Within the maplotlib package is the pyplot module. It is a collection of functions that give the matplotlib package capabilities that are very similar to the programming language MATLAB. The pyplot module has functions that can create some of the following basic plots:
| Plot type | Command | What to use it for |
|---|---|---|
| Bar plot / histogram | bar() |
Population data summaries. Helpful for contrasting between groups |
| Scatter plot | scatter() |
Multiple independent measurements across different variables |
| Line plot | plot() |
Multiple measurements that represent the same sample(s) |
| Histogram | hist() |
Generate a distribution by binning your data |
| Stem or lollipop | stem() |
A twist on the bar plot that may be more compact and visually pleasing |
| Boxplots | boxplot() |
Create a visual summary of your datapoints based on their distribution |
| Violin plots | violinplot() |
Create a visual kernel density (distribution) estimate of datapoints |
Within each plot are a number of basic components: titles, axis properties, legends, etc. Here is a helpful table outlining some of the basic plot components.
| Component | Description | Command | Parameters |
|---|---|---|---|
| Title | The title of your plot | title() |
|
| X- or Y-axis title | The axis titles of your plot | xlabel(), ylabel() |
xlabel=str |
| loc={'left', 'center', 'right'} | |||
| Text properties | |||
| X- or Y-axis ticks | Alter your axis tick positions/locations and labels | xticks(), yticks() |
ticks=[a,..n] |
| labels=[label1, ..., labeln] | |||
| Axis limits | A list defining the x- and y-axis limits | axis() |
[xMin, xMax, yMin, yMax] |
| Axis scale | Set the kind of axis scale for your data | xscale(), yscale() |
"linear", "log", "symlog", "logit" |
| Text properties | Labels can take text parameters too | color, fontsize, fontstyle, rotation |
bar() method¶We'll use our body subsite results as an example to try and plot some of our data as a bar plot. The bar() method generally requires two sets of data to be supplied along with some optional data:
x: An array of x-coordinates (group labels, or x values)height: A float or array of bar heights - these are usually the measured/summarized valueswidth: A float or array of bar widths (default is 0.8)bottom: The y-coordinates of the bars (default is 0)align: The alignment of the bars to your x-coordinate labels (default is center)Let's start by building a basic barplot and see what needs to be altered as we move forward. Note that we use the plt.show() method to display our plot after putting a lot of pieces together.
# We'll reset the code cells to only show the last code call. This will de-clutter the plotting process for us
InteractiveShell.ast_node_interactivity = "last"
# Determine the mean number of embryos across all non-infected observations per strain
wormStrain_mean_embryos = (embryo_merged_subset.loc[embryo_merged_subset['pathogenDose'] == 0, :]
.groupby(['wormStrain'])['numEmbryos']
# Get the mean value in each group
.mean()
# Sort the data
.sort_values(ascending = False)
)
# Check on the results
wormStrain_mean_embryos
wormStrain MY2 21.886179 CB4856 20.750000 AWR145 20.656000 AWR144 19.228000 N2 18.912500 MY6 18.360656 JU360 16.788618 JU300 15.622951 VC20019 15.143678 JU642 14.733333 ED3042 13.147541 ED3052B 12.047297 MY1 11.725581 JU397 11.457143 JU1400 11.139501 AB1 10.903226 ED3052A 10.526667 VC40171 4.586667 Name: numEmbryos, dtype: float64
# Build our barplot by giving the index as the x-label, and values as the height
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos)
# Show our plot
plt.show()
figure() to set your figure size¶As we can see from our first attempt, the plot is rather small. We definitely have some problems with the basics of this plot and we'll address the first one that might help. This figure could be a little larger so we can see the x-axis labels better. Use the figure() method to set your figure size using the figsize parameter.
# Fix the size of the plot
plt.figure(figsize = (12,5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos)
# Show our plot
plt.show()
xticks() method¶Now that our plot is larger, we can still see the x-axis labels are still too large. We can, however, rotate the text and see if that helps. Let's rotate to a 90-degree angle. We'll alter these axis properties through the xticks() method.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos)
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Show our plot
plt.show()
Okay, the plot is larger and we've fixed our x-axis label issues. No more overcrowding! Let's add a main title and axis titles to our dataset. We can use the title(), xlabel(), and ylabel() methods in this case. While we set the labels, we can also set their properties such as fontsize, fontstyle, and color.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos)
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
plt.xlabel("worm strain", fontstyle = "italic")
plt.ylabel("mean embryo count", color = "r")
# Show our plot
plt.show()
Similar to altering text properties, most of the plots have the ability to alter various properties like fill and line
color or other plot-specific attributes. Let's update the fill and line colours for our barplot using the color and edgecolor parameters.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos,
color="orchid",
edgecolor="black")
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
plt.xlabel("worm strain", fontstyle = "italic")
plt.ylabel("mean embryo count", color = "r")
# Show our plot
plt.show()
?plt.axis
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos,
color="orchid",
edgecolor="black")
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
plt.xlabel("worm strain", fontstyle = "italic")
plt.ylabel("mean embryo count", color = "r")
# Alter the axis limits
plt.axis(...)
# Show our plot
plt.show()
seaborn package¶Building upon our visualizations in the last section, there are some common themes you might recognize about them. We have a plot area, x- and y-axis data, axis limits, and plot colors. Using matplotlib to help generate your visualizations, you can control many small details but it can also be tedious at times to micromanage so many aspects of your plot.
The seaborn package is actually built upon the pyplot module and tries to bring a high-level approach to statistical plots. As we'll see later on, this means updating certain details of our plots will require an understanding of the base matplotlib and pyplot functions.
seaborn package subdivides plot types into 3 categories¶The seaborn package takes a dual-pronged approach to generating plots. There are functions considered to work at the Figure level and then there are functions that affect what is known as the Axes level. To simplify the concept:
Axes: A single plot defined by an x and y axis grid. This includes all of the basic plots like scatter and box plots.Figure: A plot space that can range from 1 to multiple Axes. The setup of Axes can be simple to complex.At the Axes levels, there are 3 categories of plot types based on their similarity: relational, distribution, and categorical. For each of these categories there is a figure-level function that can be used to create multi-panel (faceted) versions of these plots by splitting the data further by categorical variables.
![]() |
|---|
From the seaborn overview: for most simple plots, one of the above figure or axes-level plots can be utilized. |
The above functions are used to initialize figure and axes objects by identifying a number of properties. Within these, some of the options can vary greatly based on plot type. Of the two levels of functions, their influence on figure attributes can vary:
FacetGrid object that have some additional methods for altering attributes of the plot in a way that makes sense to the subplot organization.Axes they are drawn onto but do not alter the figure in any other way. You can choose to draw onto the current axes in memory OR specify the reference to an axes which may be within a larger figure.One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:

The grammar of graphics is a language to communicate about what we are plotting programmatically
It begins with a tidy data frame. It will have a series of observations (rows) each of which will be described across multiple variables (columns). Variables can actually represent qualitative or quantitative measurements or they could be descriptive data about the experiments or experimental groups.
The data units may undergo conversion through a process called scaling (transformation) before being used for plotting.
A subset of data columns are then passed on to be presented in various data plots (scatterplots, boxplots, kernel density estimates, etc.) by using the data to describe visual properties of the plot. We call these visual properties, the aesthetics of the plot. For example, the data being plotted or represented can be visually altered in shape or colour based on accompanying column data.
A plot can have multiple layers (for example, a scatter plot with a regression line) and each of these plot types is referred to as a geom (short of geometric object).
seaborn¶The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of R tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework in the ggplot2 package for R. By following a layered approach of defined components, it can be easy to build a visualization.
In a similar manner, the seaborn package has some methods that facilitate a layering approach to building your visualizations. However, many of the details are built upon the foundation of layering Axes objects or alterations upon Figure objects.
Each Axes-level function usually takes in:
Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
Geometric objects: how will you display your data within your visualization. Which *plot will you use?
The figure-level methods can be used to alter or update:
Scale: do you need alter your x or y-axis limits? What about scaling/transforming any values to fit your data within a range? Sometimes, depending on the Geometric object, you are better off transforming your data ahead of time.
Facets: will generating subplots of the data add a dimension to your visualization that would otherwise by lost?
Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?
Let's jump into our first dataset and start building some plots with it shall we?
infection_signal.tsv¶Before we dig into the seaborn package, we will import a new dataset that has a slightly more diverse array of data that we can use for showcasing the plotting power of seaborn.
# We'll reset the code cells to only show FINAL code output.
InteractiveShell.ast_node_interactivity = "last"
# Read the pitlatrine data in from file
infectionSig_data = pd.read_csv("data/infection_signal.tsv", sep = "\t")
# Look at the first 5 rows of data
infectionSig_data.head(5)
| exp.name | strain | spore.strain | spore.species | dose | spores | fixing.date | slide | file | worm.number | area | percent.infected | area.infected | timepoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 1 | 49838.02 | 18.53 | 9234.985106 | 72hpi |
| 1 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 2 | 50425.04 | 0.00 | 0.000000 | 72hpi |
| 2 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 3 | 45532.67 | 31.16 | 14187.979970 | 72hpi |
| 3 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 4 | 46458.55 | 3.88 | 1802.591740 | 72hpi |
| 4 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 5 | 49214.73 | 0.00 | 0.000000 | 72hpi |
# How big is this dataframe?
infectionSig_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 456 entries, 0 to 455 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 exp.name 456 non-null object 1 strain 456 non-null object 2 spore.strain 456 non-null object 3 spore.species 456 non-null object 4 dose 456 non-null object 5 spores 456 non-null float64 6 fixing.date 456 non-null object 7 slide 456 non-null int64 8 file 456 non-null object 9 worm.number 456 non-null int64 10 area 456 non-null float64 11 percent.infected 456 non-null float64 12 area.infected 456 non-null float64 13 timepoint 456 non-null object dtypes: float64(4), int64(2), object(8) memory usage: 50.0+ KB
# What are the unique values in each column?
infectionSig_data.apply(pd.unique, axis = 0)
exp.name [N2-LUAm1-1.8, JU1400-LUAm1-1.8, AWR144-LUAm1-... strain [N2, JU1400, AWR144, AWR145] spore.strain [LUAm1] spore.species [N.ferruginous] dose [pulse-72H] spores [1.8] fixing.date [rep1, rep2, rep3] slide [1] file [N2.LUAm1.rep1, JU1400.LUAm1.rep1, AWR144.LUAm... worm.number [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... area [49838.02, 50425.04, 45532.67, 46458.55, 49214... percent.infected [18.53, 0.0, 31.16, 3.88, 23.54, 1.65, 8.6, 18... area.infected [9234.985106, 0.0, 14187.97997, 1802.59174, 93... timepoint [72hpi] dtype: object
Taking a quick look at our data we can summarize it briefly here and see that there are a few categories we can explore across variables like strain and fixing.date for measured variables like area (total area of a worm), area.infected (infection signal area) and percent.infected (infected area as a % of the total).
relplot() method¶We'll start by building a basic scatterplot. We'll focus on comparing the total chemical oxygen demand versus the total solids count. Rather than working at the axes-level, we'll work with the encompassing relplot() method which will give us flexibility in our visual exploration down the road.
For the basic plot relplot(), we'll start with the following parameters:
data: The tidy (long-form) data set we want to visualize.x, y: The variable names for assigning x- and y-axis values.height: The total height of our figure.aspect: A scalar value used to determine the width of your figure (width = height * aspect).kind: The type of plot we want to produce: scatter (default) vs linekwargs: This is a catch-all for any other keyword arguments that could be passed onto underlying functions like the Axes-level methods.# Import your seaborn package
import seaborn as sns
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data, # Set the data
height = 6, aspect = 1, # Set the size of the figure
kind = "scatter" # Set the figure type
)
# Show the plot
plt.show()
Looking at the output we can see that without having specified the x and y axis values, it simply plotted all of the variables along a default x-axis of index number (456 rows total). Let's try again by setting our axis variables.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected", # Set the axis variables
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot
plt.show()
hue parameter¶Now we begin to see the power of having a tidy DataFrame. Since each of our observations is in its own row, we can classify each observation by factors like strain! Using the seaborn package, we can specify the colour of our points using the hue parameter. Since we have set the data = infectionSig_data parameter, we can tell seaborn to look at a specific column when determining the hue parameter.
At the same time, we'll set the alpha parameter which is essentially the opacity of each datapoint. Setting a lower value increases transparency which allows us to see overlapping datapoints better. This parameter becomes especially helpful when working with extremely dense datapoints.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot
plt.show()
set method¶The set() method is a gateway to altering a number of aspects of your plot. Once we have our plot saved as an object named snsplot we can alter or set some of it's properties this way. In particular we will be using the yscale parameter to set adjust the y-axis log scale.
Note that our plot object is actually kept as a type of matplotlib.axes.Axes and that's where we are calling the set() method from. We can actually set quite a few figure attributes through this method.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 1,
kind = "scatter"
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
relplot() method¶If we want to split our data into multiple new plots based on certain variables, this is known as faceting your data. Usually this results in a grid-like pattern where data is grouped by categories of one variable as columns, and another variable by rows. Although the data could simply be split on a single variable instead. Either way, this generates a figure-level object known as a FacetGrid().
The relplot() method already has the capability to handle this splitting of axes within the figure it generates. The relplot() method can facet data across two variables using the row and col parameters. We simply need to name the variable(s) that will be used to categorize the data.
To summarize, the parameters to use for this operation are:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.Below, we'll remove the colouring of points based on strain and instead, split the data into two Axes based on this information.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 0.6, # Set the size of the figure
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
Note that we typically need to just apply one attribute to each dimension of data we are investigating. By splitting the data by Country we no longer need to colour it based on this category. We can, however, add additional information to our visualization by using another dimension in our data. Instead of colouring the points based on a categorical variable (Tanzania vs Vietnam), we can use a continuous variable like Temp from our dataset to see if there could be a trend in relation to our data.
It is easy enough to set this dimension using the hue parameter in our initial relplot() call. We'll also set the palette parameter to a different colour. We'll also take a second to set the edgecolor parameter so that our lower-value/white points can still be seen.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", # Set the point-colour by percent infected
edgecolor = "black", # Set the point border colour so we can see them all
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
By colouring our datapoints by temperature, we can now see more clearly on the same plot that latrines samples from Tanzania generally are measured at a higher temperature. Rather than generating additional plots comparing different pairs of variables, we've simply added an additional dimension of information to our visualization.
set_axis_labels() method¶The names of our axis titles are drawn from the variable names we used for the original DataFrame but we may be limited in how those variables are originally named. In other cases you may wish to add units, or simply make your axis titles more descriptive. To accomplish this we can alter our labels directly using the set_axis_labels() method. The parameters to set are x_var and y_var in that order. Set them directly if you only want to change a single axis title.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", # Set the point-colour by percent infected
edgecolor = "black", # Set the point border colour so we can see them all
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels(x_var = "Total area", y_var = "Area infected")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
style parameter¶We'll switch gears a little at this point and ask what our data looks like when we compare the Depth of our latrine measurements instead. There are many depth measurements (11 total) which will be very hard to read if they are presented in a single row so we'll also use the col_wrap parameter to set the number of facets per row. Note that this will not be compatible with faceting by both a column and row variable.
We'd still like to distinguish between our two different Country values so we'll assign the shape of these values using the style parameter instead.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", # Set the point-colour by percent infected
edgecolor = "black", # Set the point border colour so we can see them all
style = "fixing.date",
alpha = 0.6,
height = 6, aspect = 0.5,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels(x_var = "Total area", y_var = "Area infected")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
Now that we have some of the basics, it's time to take a closer look at using other types of plots. Let's return to our embryo data in embryo_merged_subset. It has a lot of nice population-based data that we can dissect to look at theoretical distributions.
&¶We'll focus our dataset first by filter for just the N2 strain in the mock infection condition. To accomplish this we'll use the conditional AND & operator which can combine our boolean expressions. To accompany this, you can also use the conditional OR | and we've already seen the logical NOT ~ which can convert our boolean to it's opposites.
Along with this, each boolean expression must be separately enclosed by the parentheses syntax ( ). We'll talk more about this next time in Lecture 05. For now, we'll save the filtered data into the object N2_mock_data.
# Change the value of ast_node_interactivity
# InteractiveShell.ast_node_interactivity = "all"
# Filter the data by N2 animals with a pathogenDose of 0
N2_mock_data = embryo_merged_subset.loc[(embryo_merged_subset['wormStrain'] == "N2") &
(embryo_merged_subset['pathogenDose'] == 0),
:]
# Check on the data created
N2_mock_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 720 entries, 503 to 10748 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 720 non-null int64 1 date 720 non-null int64 2 wormStrain 720 non-null object 3 pathogenStrain 720 non-null object 4 pathogenDose 720 non-null float64 5 doseLevel 720 non-null object 6 timepoint 720 non-null object 7 merontsPresent 720 non-null bool 8 sporesPresent 720 non-null bool 9 numEmbryos 720 non-null int64 10 experiment 720 non-null object 11 experimenter 720 non-null object 12 description 720 non-null object 13 Infection Date 720 non-null int64 14 Plate Number 720 non-null int64 15 Total Worms 720 non-null int64 16 Spore Lot 720 non-null object 17 Lot concentration 720 non-null int64 18 Total ul spore 720 non-null float64 dtypes: bool(2), float64(2), int64(7), object(8) memory usage: 102.7+ KB
kdeplot() or displot()¶There are a lot of datapoints in our new dataset. This time our measurements are based on the numEmbryos for specific uninfected worms within different replicate experiments. A quick question we already investigated was to identify the overall distributions of our embryo numbers across different replicates for all of our strains. We can, however, also visualize this data as a distribution.
A quick way to answer this question is by generating a kernel density estimate (KDE) using the displot() method from seaborn. We'll need to provide an x value (numEmbryos in this case) and we'll colour our plots based on date variable which is coded as a series of integers which should represent our separate experimental replicates.
# We'll build a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot
plt.show()
Why do we only have 2 colours in our distribution plot? We can see all of the various dates, BUT only the first is really coloured and the others all have a darker hue. It is very reminiscent of when we coloured our scatterplot by the hue parameter as well. However, should the values in date be considered as numbers or as separate groups?
category data type reassigns order or meaning to your values¶Sometimes when we work with our data, we may produce what look like numerical values for a variable like a replicate number or serial number. However, these values aren't just numbers but really grouping values just like we have in our wormStrain variable. Python/pandas doesn't differentiate on this idea inherently when passing data around because it has not insight into our intentions. Moreover when we work with the seaborn package - it will also treat integers and floats just like integers and floats.
Recall in section 3.3.1 we looked at the dtype values in our DataFrame and the date variable was an int64. What we want is some other way to represent this data. We could convert it to a string str BUT we'll introduce a better dtype called the category.
Categorical variables are very handy when working with statistical analysis and help to define groups but can also give them an order of importance. This means when analysing or plotting the data, this specific order can be used to determine how that data is used or displayed.
To start off simple, let's convert our date variable to a category data type and see how that affects out KDE plot.
# Convert our date variable with the astype() method
N2_mock_data = N2_mock_data.astype({'date':'category'})
# rebuild a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot
plt.show()
It worked! Now we can see that each date has been given a distinct colour!
So right now we can see the order of our data has been set by the date values with our earliest date (190426) on top and our latest date (200918) on the bottom. When defining categorical data, you can also define the order of your data. To help us out, we'll grab an array of the unique elements in our date variable and then we can use the slicing notation [::-1] to reverse our array.
# Convert our date variable back to an int the astype() method
# N2_mock_data = N2_mock_data.astype({'date':'category'})
# Take the unique values and flip the order
date_list = N2_mock_data.date.unique()[::-1].tolist()
# View the reversed date list
date_list
[200916, 200905, 200918, 200915, 200904, 200825, 200821, 200721, 200714, 200707, 190426]
reorder_categories()¶Now that we have a reverse list of our categories (or you could make a custom list of course), you can replace the date column in our dataset. To alter teh categorical information, you must access the .cat property and use the .reorder_categories() method on it. This will return a new Categorical object which you must use to replace the original date data.
We'll take a look at our update KDE plot afterwards and see if it worked!
# We'll need to pull out and replace the date column
N2_mock_data['date'] = (N2_mock_data['date']
# access the categorical property
.cat
# Reorder the categorical data
.reorder_categories(new_categories = date_list)
)
# Double-check the date column results
N2_mock_data['date']
503 190426
504 190426
505 190426
506 190426
507 190426
...
10744 200916
10745 200916
10746 200916
10747 200916
10748 200916
Name: date, Length: 720, dtype: category
Categories (11, int64): [200916, 200905, 200918, 200915, ..., 200721, 200714, 200707, 190426]
# Rebuild a KDE plot with the new category order
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date", # Now the date variable is a category!
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot
plt.show()
rugplot() to the margin¶Within seaborn there are a few ways marginal plots that can be added to your visualizations. Marginal plots usually add distribution summaries like a histogram, kde or in our case, a rugplot. More specifically, we'll be using the rugplot() method but there is also the ability to create certain plot combinations using the jointplot() method.
A rugplot is simply a series of vertical or horizontal tick-marks representing our actual data points along the x- and/or y-axis. For our rugplot, we'll be add it outside of our plot area by manipulating the height and clip_on parameters to help us out. We'll also set the alpha parameter to help us see the density of our tick marks a little better. Despite where it is plotted, we are adding this plot on the underlying Axes object of the current snsPlot figure.
# We'll build a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date", # Now the date variable is a category!
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Add a rugplot - this is plotted on top of the current axes object
sns.rugplot(data = N2_mock_data,
x = "numEmbryos",
height = -0.02, clip_on = False,
alpha = 0.5
)
# Show the plot
plt.show()
Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles).
Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.
We are going to use boxplots to see the distribution of embryos per worm strain across all uninfected worm samples.
catplot() method¶To build the basic boxplot we begin with the main variables. We want to summarize the OTU counts (y-axis) for each taxa (x-axis) to get a sense of the OTU count distribution for each taxon. We can use the catplot() method to build our visualization. You'll note that the plot is automatically coloured to differentiate between the x-axis groups.
The catplot() method is the gateway to more categorical plots and behaves similarly to the other two figure-level plots we've encountered.
Before we begin, we'll summarize the data again as we did earlier in section 1.3.0 but this time we'll also include an extra level to group by in doseLevel.
As well, we'll use the .isin() method to help us filter. This method will apply an element-wise determination if the element is within a supplied list of values
use the .agg() method for flavour.
# Filter for only uninfected data
mean_Embryo_data = (embryo_merged_subset.loc[embryo_merged_subset['doseLevel']
.isin(values = ["Mock", "Medium"])] # filter by doseLevel
# Group by infection experiment
.groupby(by = ['date', 'wormStrain', 'pathogenStrain', 'doseLevel'])
# Create the frequency table on numEmbryos
['numEmbryos']
.agg('mean')
# Recall, when we reset the index, it converts the indices back into columns
.reset_index()
.astype({'doseLevel':'category'})
)
# Check out our summarized data
mean_Embryo_data
| date | wormStrain | pathogenStrain | doseLevel | numEmbryos | |
|---|---|---|---|---|---|
| 0 | 190426 | AB1 | LUAm1 | Medium | 8.583333 |
| 1 | 190426 | AB1 | LUAm1 | Mock | 10.903226 |
| 2 | 190426 | CB4856 | ERTm5 | Medium | 17.484375 |
| 3 | 190426 | CB4856 | ERTm5 | Mock | 20.750000 |
| 4 | 190426 | ED3042 | LUAm1 | Medium | 2.352941 |
| ... | ... | ... | ... | ... | ... |
| 138 | 200916 | N2 | ERTm5-96H | Mock | 20.140000 |
| 139 | 200918 | AWR144 | ERTm5 | Mock | 20.060000 |
| 140 | 200918 | AWR145 | ERTm5 | Mock | 21.760000 |
| 141 | 200918 | JU1400 | ERTm5 | Mock | 14.140000 |
| 142 | 200918 | N2 | ERTm5 | Mock | 21.360000 |
143 rows × 5 columns
Now we can build our boxplot using the summarized data!
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2 # Set the width of our plot to 4 and width at 12
)
# Show the plot
plt.show()
set_xticklabels() method¶We've encountered this problem before the the solution is actually the same. Recall that seaborn is built upon the back of the pyplot module so we can actually modify the plot directly through pyplot!
We'll rotate the x-axis text with the set_xticklabels() method on our plot object to 90 degrees.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
hue parameter¶We already know that our data is measured across two countries so we can take advantage of this information to create a nested (paired/grouped) set of boxplots. When you have a smaller number of categories, this allows you to more directly compare the characterstics of your two populations. Let's see what happens when we use the hue parameter to distinguish between our country-level data.
We'll also set the boxplot() parameter width to put a little more distance between each category along the x-axis. This value usually ranges between 0 and 1.
To save on some space we'll additionally move the legend into the plot using the legend_out boolean parameter.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
legend_out = False, # Move the legend inside the plot
width = 0.6 # Put some more distance between categories by decreasing width
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
Let's take a moment to fix our doseLevel category so that we plot the "Mock" data before the infected data for our boxplot. Remember,w e can use the .reorder_categories() to accomplish this
# Pull out our doseLevel data and replace it with a newly categorized version
mean_Embryo_data['doseLevel'] = (mean_Embryo_data['doseLevel']
.cat
.reorder_categories(new_categories = ["Mock", "Medium"])
)
# Double check it worked
mean_Embryo_data['doseLevel'].cat.categories
Index(['Mock', 'Medium'], dtype='object')
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
legend_out = False, # Move the legend inside the plot
width = 0.6 # Put some more distance between categories by decreasing width
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
row and col parameters¶So our plot above is quite busy. Whereas previously, we used the relplot() method to generate a faceted scatterplot (or relational plot), here we'll use the catplot() method to accomplish something similar. The catplot() method handles the distribution or faceting of categorical plots using similar parameters:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.This time around we'll facet our data into a stacked set of plots using the row parameter. You'll notice that since we are no longer grouping by hue that the legend will also disappear.
At the same time, we'll play with a few additional methods:
set_axis_labels(): can accept an x and y-axis label in the form of a string. You can also set separate labels with the set_xlabels and set_ylabels methods.set_xticklabels(): we already used this to rotate our x-axis text but we also can set the labels parameter.set_titles(): we'll use this to simplify the name of each panel to use just the value from our variable. This is found in string "{col_name}". "{col_var}" would use the variable name we're setting our facet by, ie. "wormStrain".sns.set(font_scale = 4)
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
col = "wormStrain",
col_wrap = 5
)
# Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
# Change our y-axis label
snsPlot.set_axis_labels("dose level", "embryos per animal")
# set the title
snsPlot.set_titles("{col_name}")
# Show the plot
plt.show()
map_dataframe()¶Even though boxplots give us summary statistics on our data, it is useful to readers (and reviewers) to be able to see where our individual data points are. We've already used rugplot() to help visualize our data distribution in density plots. In that case, we simply plotted on top of the already present simple plot.
Similarly, for a boxplot we can add the data as another layer using using an sns.swarmplot() to place dots on top of our boxplot. A swarmplot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data.
In the case of our _factedboxplot however, we cannot simply overlay with sns.swarmplot(). Instead, we need to map our data using the .map_dataframe() method. It will preserve the underlying panel/graph and use it's characteristics to overlay a new plot, potentially with new data. It uses the following parameters:
func: the function we want to overlay. This would be the sns.swarmplot in our case. Note the lack of parentheses!args and kwargs: we'll talk more about this next week, but essentially any additional arguments you would normally use for your plotting function, are simply supplied here as named parameters.# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
col = "wormStrain",
col_wrap = 5,
fliersize = 0 # Hide our outliers by making them size 0
)
# Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
# set the title
snsPlot.set_titles("{col_name}")
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.swarmplot, data = mean_Embryo_data,
x = "doseLevel",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=15 # Set the size of our points so they can be seen
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Show the plot
plt.show()
If you could combine aspects of the boxplot and the KDE into a single visualization, you would think it was the violin plot. Another way to think of the violin plot is as the KDE plot that's been shrunk down and placed categorically.
It's actually quite easy to switch over since many of the aspects are similar to the boxplot. We need only change the kind parameter in our catplot() code.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "violin", # Make a violin plot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
col = "wormStrain",
col_wrap = 5,
fliersize = 0 # Hide our outliers by making them size 0
)
# Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
# set the title
snsPlot.set_titles("{col_name}")
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.swarmplot, data = mean_Embryo_data,
x = "doseLevel",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=15 # Set the size of our points so they can be seen
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Show the plot
plt.show()
split parameter¶Sometimes a more direct comparison of your data can be applied through the violin plot by generating a split version of it. This is especially helpful when you are working with nested data that is binary and you would like to compare it visually.
We'll initialize this visualization with the split boolean parameter. To help with this visualization we'll also:
swarmplot to a stripplot to accomodate a more narrow width of the half-violinsinner markers of the quartile information in each violinpalette parameter which accepts a dictionary-like object too!sns.set(font_scale = 1)
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "violin", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}
)
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.stripplot, data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.
plt.savefig() method¶Once you have a figure the way you want it, you can save in any number of graphical and non-graphical formats. The savefig() method from the pyplot package is here to save the day. Saving the current figure, you can use some of the following parameters:
fname: The path to your file you want to save, including the extension. If format is not set, then the file extension will be used to infer the format instead.dpi: The resolution in dots per inch for your figure.format: The file format you'd like to use. Supported filetypes include svg, jpg, eps, and pdf.Let's save our faceted scatterplot from way back in section 3.2.6.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "violin", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}
)
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.stripplot, data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Save the plot
plt.savefig("data/mean_embryos_byStrain.png",
format = "png",
dpi = 300
)
Up until now we've been generating a combination of either faceted plots or simply layering elements upon single plots. Throughout all of these we have not really been mixing the types of plots we've generated. Luckily for us, the matplotlib.pyplot module provides a means for us to put together multiple plot axes in a single figure.
We have already seen some of these figure-level functions in action with relplot() and catplot() which provided an interface to axes-level methods like scatterplot() or boxplot when created faceted plots.
![]() |
|---|
| A handy figure from the seaborn overview at: https://seaborn.pydata.org/tutorial/function_overview.html |
What if, however, we would like to create a figure with multiple axes of different types?
subplot2grid() to generate a multi-grid plot¶When generally considering the layout of our data, we want to think of breaking up the figure into a grid. This can start off simply as a 1x1 panel and expand outwards with nested panels within a 2x2 or 3x3 or larger figure. The dimensions also are not limited to square shapes but can be rectangular as well.
The subplot2grid takes the following parameters:
shape: The dimensions of the figure given in a (numRow, numCol) tuple. This is essentially the backdrop of all the panels.loc: The location of the subplot (Axes object) you are creating in relation to the base figure. Position (0,0) is the top left corner.rowspan: The dimensions of the subplot in number of rows.colspan: The dimensions of the subplot in number of columns.fig: A figure object to place the Axes object in. Otherwise the current figure is used.![]() |
|---|
| Some simple layouts demonstrating how a figure can be subdivided |
Let's make a figure with 3 plots as seen in the 4th example above. We'll begin just by generating the specific layout.
# Initialize a figure
fig = plt.figure()
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid(shape = (2, 2), loc = (0,0), colspan = 1, rowspan = 2)
ax2 = plt.subplot2grid(shape = (2, 2), loc = (0,1), colspan=1)
ax3 = plt.subplot2grid(shape = (2, 2), loc = (1,1), colspan=1)
plt.show()
Axes objects as a canvas to plot onto fig¶Now you can see that our figure encompasses the 3 panels that we envisioned. You'll notice that we named each panel with a reference/variable. We could have also added them to a single list object to save on variable names but to simplify our understanding they've been named separately.
Why do we need these objects? Each Axes-level function we use to plot with, can take a parameter ax where we pass along an Axes object. This identifies which panel we want to plot onto in our overall figure. Otherwise it will use the last Axes object generated (ie ax3). We'll fill our Axes as follows:
ax1: A scatterplot object from our infection signal dataax2: Our split-violin plot of the mean embryo values across strainsax3: A kde plot of the uninfected embryo counts per strainEach of these is a matplotlib.axes.Axes object
There is, however, a quick hitch. We can no longer rely directly on the figure-level functions of relplot, catplot and displot. In order to plot correctly on these subpanels, we'll need to use the axes-level function calls instead. We'll add our plots one at a time so you can see the effect of each.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1
)
# Directly set the labels on the x-axis and y-axis
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# Show the plot
plt.show()
Axes¶Next we'll populate the second panel (top-right) with a combination of violinplot() and swarmplot(). We'll adjust the x-axis tick labels through the .tick_params() method as we will be dealing with the matplotlib.axes.Axes object.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use catplot to make our boxplot
sns.violinplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"},
alpha = 0.6,
ax = ax2
)
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# Show the plot
plt.show()
Axes methods¶So you can see there's a slight issue above with our legend in the violin plot. Due to plotting both a violin and strip plot together, we get a legend with both the colouring and the points. The points from the strip plot don't have much meaning so we can remove those easily in the call to sns.stripplot() by using the legend = False parameter.
If you wanted to remove the legend altogether, you could use the .get_legend().remove() command. This would access the Axes object legend and remove it.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use catplot to make our boxplot
sns.violinplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"},
alpha = 0.6,
ax = ax2
)
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# Show the plot
plt.show()
Let's complete the set by adding a KDE plot to our final panel. We'll filter our dataset on the fly as we pass it to the kdeplot() function.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use catplot to make our boxplot
sns.violinplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"},
alpha = 0.6,
ax = ax2
)
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# ---------- Plot 3 ----------#
# rebuild a KDE plot
sns.kdeplot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Directly set the x-axis label
ax3.set_xlabel("mean embryos per animal")
# Show the plot
plt.show()
Axes using tight_layout()¶Nearly there, we can see there are some overlapping text issues from the y-axis labels of the KDE plot onto the scatter plot beside it. This is due to the size of the y-tick text itself pushing the title too far left. We can ask the plot to fix these spacing issues using the plt.tight_layout() method, which should fix the overlapping issues as best as it can.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use catplot to make our boxplot
sns.violinplot(data = mean_Embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"},
alpha = 0.6,
ax = ax2
)
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_Embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# ---------- Plot 3 ----------#
# rebuild a KDE plot
sns.kdeplot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Directly set the x-axis label
ax3.set_xlabel("mean embryos per animal")
### Re-adjust the axes to remove overlap due to axis text
plt.tight_layout()
# Show the plot
plt.show()
Not too shabby! And we're done!
That's our fourth class on Python! You've made it through and we've learned about taking advantage of in-built DataFrame methods for exploratory data analyis as well as how to finally visualize some of your data:
groupby() and aggregation functions.matplotlib.pyplot module.seaborn package.seaborn figures through the pyplot package.At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1300 hours the following day).
Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete the Introduction to Data Visualization with Seaborn course (4 chapters, 3700 possible points). This is a pass-fail assignment, and in order to pass you need to achieve a least 2775 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please print a PDF of the summary. Navigate to the Learn section along the top menu bar of DataCamp. This will bring you to the various course you have been assigned under My Assignments. You can click on your completed assignment and expand each chapter of the course by clicking on the VIEW CHAPTER DETAILS link. Do this for all sections on the page and then select all on the page (ie ctrl + A). Go to print the page from your browser menu you can save as a single PDF. If you don't try to select all (at least in Google Chrome) you may not be able to print the full page.
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 09:59 hours on Tuesday, February 7th to submit your assignment (right before the next lecture).
![]() |
|---|
| A sample screen shot for one of the DataCamp assignments. You'll want to try and print off a single PDF of this section from Learn > My Assignments |
Revision 1.0.0: materials prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
In the case that your kernel crashes, you can use this code cell to recreate all of the data that is used in section 5.0.0. Convert the below cell into a coding cell by using the "Y" key when highlighting it in "Command" mode. Then simply run the cell and it should recreate the datasets needed.
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.
